Metrics for measuring distances in configuration spaces 
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In order to characterize molecular structures we introduce configurational fingerprint vectors 
which are counterparts of quantities used experimentally to identify structures. Components of 
such vectors can be associated to individual atoms and can then serve as an atomic fingerprint that 
identifies an atom within a structure. The Euclidean distance between the configurational fingerprint 
vectors satisfies the properties of a metric and can therefore safely be used to measure dissimilarities 
between configurations in the high dimensional configuration space. We show that these metrics 
correlate very well with the RM SD between two configurations if this RMSD is obtained from a global 
minimization over all translations, rotations and permutations of atomic indices. We introduce a 
Monte Carlo approach to obtain this global minimum of the RMSD between configurations where 
atomic fingerprints are used to enhance the performance of the procedure. 



Quantifying dissimilarities between molecular struc- 
tures is an essential problem encountered in physics and 
chemistry. Comparisons based on structural data ob- 
tained either from experiments or computer simulations 
can help identifying or synthesising new molecules and 
crystals. Diversity analysis is at the heart of any struc- 
ture prediction method in material science and solid state 
physics [H-Q and conformer search in structural biology 
and drug discovery I6l4l2l . In the latter case, most of the 
proposed approaches jl3-fl5j use approximate methods 
that reduce the structure description information, e.g. 
by excluding the side chains in a protein or a two di- 
mensional representations of the molecule [l6j], to speed 
up the searching procedure [It}- In the case of solid 
state physics, fairly accurate dissimilarity measures are 
required. Within the structure prediction methods based 
on the evolutionary algorithms [l| , the required diversity 
of populations can only be maintained if strongly similar 
configuration are eliminated. Within the minima hop- 
ping structure prediction method Q, an identification 
of identical configurations is required as well to prevent 
trapping in funnels that do not contain the global min- 
imum. Some machine learning approaches |l8[ are also 
based on similarity measures. 

It is natural to characterize the dissimilarity between 
two structures p and q by a real number d(p, q) > 0. 
In order to give meaningful results d(p, q) should sat- 
isfy the properties of a metric, namely (i) coincidence 
axiom: d(p, q) = if and only if p = q, (ii) sym- 
metry: d(p,q) = d(q,p) and (hi) triangle inequality: 
d(p,q) + d{q,r) > d(p,r). The coincidence axiom en- 
sures that two configurations p and q are identical if their 
distance is zero. The triangle inequality is essential for 
clustering algorithms. If it is not satisfied, then it could 
happen that a configuration that is within one cluster 
is also part of another cluster even though the distance 
between the two clusters is very large in configuration 
space. 

In this article we will introduce a family of metrics 



for measuring distances in configuration spaces. We will 
show the performance of these metrics for a representa- 
tive set of benchmark systems including covalent, metal- 
lic and organic structures. Each configuration is a low 
energy metastable configuration obtained during a struc- 
ture search using the Minima Hopping Method [2( on the 
density functional theory (DFT) level as implemented in 
the BigDFT code \v^. 

A configuration of n alike atoms is uniquely repre- 
sented by R = (t*i, r2, • • ■ , r n ) £ M 3x ", where the column 
vector Ti represents the Cartesian coordinates of atom i. 
A distance induced by the naive Frobenius norm 
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can not be used to compare two configurations p and 
q, because it is not invariant with respect to transla- 
tion or rotation of one configuration relative to the other. 
For this reason the commonly used root-mean-square dis- 
tance (RMSD) is defined as the minimum Frobenius dis- 
tance over all translations and rotations. By minimizing 
Y^i \\ r i + d — r i II 2 with respect to translation d one ob- 
tains X£(rf + d- rf) = and thus d = ± £ r? - ± £ if 
centralizes p on the centroid of q independently of their 
orientations. Therefore we will assume that all T*j are 
measured with respect to the centroids. Then, finding 
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where U is the rotation around the common centroid, is a 
local minimization problem [20I l2lj and hence we denote 
this version of the RMSD by RMSD;. It can be solved 
easily 12 2| as discussed in detail in the Supplementary 
Material [23j|. 

The RMSD; is however not invariant under index per- 
mutations of chemically identical atoms. If the configu- 
ration p and q arc identical, Eq. @ will be different from 
zero if we permute for instance in R q the positions and 
r q - of atoms i and j . The minimum Frobenius distance 
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obtained by considering all possible index permutations 
for an arbitrary rotation U is 
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P being an n x n permutation matrix. This assignment 
problem is solved in polynomial time using the Hungar- 
ian algorithm (HA) |24j. However, what is really needed 
is a solution of the combined problem of the global min- 
imization over all rotations and permutations, namely 
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The global minimum RMSD fulfills all the properties of 
a metric. The coincidence and symmetry properties are 
easy to see and for the triangle inequality we have 



RMSD(p : q) + RMSD(q, r) 
= ^=\\U pq W>P pq -R q \\ + - 
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\/n 
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= RMSD(p,r). 

Since U and P are not independent, no algorithm ex- 
ists which can find the global RMSD within polynomial 
time. Just doing a search by alternating rotation and 
permutation steps is not guaranteed to converge to the 
global minimum with a finite number of steps. Trying 
out all possible permutations would lead to a factorial 
increase of the computing time with respect to n and 
this approach is therefore not feasible except for very 
small systems. We use a two-stage method for finding 
the global RMSD with moderate computational effort. 
In the first stage we try to find the optimal global align- 
ment of the two structures being compared. Since all 
these global alignment methods are empirical and can 
fail we apply several of them successively. The first align- 
ment we apply is along the principal inertia axes of the 
molecules and the other ones are along a series of the 
axes set defined by vectorial atomic fingerprints and are 
discussed in detail in the Supplementary Material [23| . 
A trial alignment is always followed by application of the 
HA to find the index permutation that gives the small- 
est RMSD [ID]. This procedure has allowed us to detect 
all identical configuration in this first stage, as seen in 
Table H 

If a small enough RMSD is not found, we enter into 
an iterative stage where randomly chosen atoms are per- 
muted within a thresholding Monte Carlo (MC) approach 
followed by applying the optimal rotation. Instead of a 
completely random switching of atoms, we steer it such 
that atoms with similar environments (i.e. fingerprints) 
are permuted with higher probability. For instance, a 



TABLE I. Number (#) of remained distinct configurations 
and average RMSD (in A) with our RMSD global minimiza- 
tion algorithm. Evolution of the values is given in the Sup- 
plementary Material [23l |. 
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317 


1.40 
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3.44 
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2.75 
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184 


1.02 


59 


1.01 


42 


1.78 


MC 


184 


.791 


59 


.824 


42 


1.51 



pair of atoms will rapidly be flipped if they both reside 
cither on the outermost layer or inside a cluster. The bi- 
asing of permutations is reduced gradually if no sufficient 
progress is made. 

While the RMSD can be considered as the most basic 
quantity to measure the dissimilarities, finding the global 
minimum RMSD is numerically costly. In the following 
we will therefore introduce a family of metrics which are 
cheaper to calculate than the global RMSD yet in excel- 
lent agreement with it. We consider vectors V of length 
N formed from all the eigenvalues and / or selected eigen- 
vectors of symmetric N x N matrices whose elements 
depend only on the interatomic distances = | j r ^ — r ^ 1 1 
of an n-atom configuration. The whole vector forms a 
structural fingerprint and therefore the normalized Eu- 
clidean distance 



A v (p,q) = -^=\\VV-V<'\ 
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measures the dissimilarly between p and q with no need 
to superimpose them. When V is formed from the el- 
ements of selected eigenvectors, then the each element 
can be associated to an atom and the ensemble belong- 
ing to one atom forms an atomic fingerprint or descriptor 
of the local environment of the atom. Since the matrix 
depends only on interatomic distances, the same holds 
true for the eigenvalues and eigenvectors, and V is thus 
invariant under translations, rotations and reflections of 
the configuration. In order to make A\r also independent 
of the atomic indices, the elements of each V are always 
sorted in an ascending order. 

Even though the eigenvalue vector is much shorter than 
the vector containing all matrix elements, the fingerprint 
distances based on the eigenvalues are better than those 
obtained by sorting all the matrix elements into a vector. 
One can in some cases construct distinct so-called homo- 
metric configurations j2(| for which the fingerprint vec- 
tors of the sorted matrix elements are identical whereas 
the eigenvalue vectors are not identical and allow thus 
to distinguish between them. In addition, our empiri- 
cal results of Fig. [T] show that the gap between identical 
and distinct pairs is larger for the eigenvalues than for 
the sorted matrix elements. Because the relaxations were 
stopped when the force on each atom is within 0.01 eV/A, 
identical configurations arc in practice identical only up 
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and permutations of atoms. Each KS eigenvalue de- 
pends parametrically on the 3n — 6 internal degrees of 
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FIG. 1. Correlation of the Euclidean distance between the 
vectors consisting of all of the sorted elements (a) or eigenval- 
ues (b) of OM and RMSD for f 000 metastable configurations. 
Each data point represents one pair comparison. 



to some hnite precision. Two configurations are consid- 
ered to be identical (i.e. belong to the same cluster) if 
their distance is below a certain threshold. The thresh- 
old is chosen unambiguously only if a gap exists in the 
distance space. 

Various n x n matrices have been used previously to 
characterize molecular configurations. The contact ma- 
trix exhibits discontinuities when the atomic distances 
cross the cutoff radius. By introducing a smooth cutoff 
these discontinuities have been removed and the result- 
ing matrix has been used as a fingerprinting tool p7| . 
Presumably not only the contact matrix but also other 
matrices from spectral graph theory such as the Laplace 
matrix could be used in a similar way. The problem is 
however that all quantities derived from such nxn matri- 
ces do not satisfy the coincidence axiom and can thus fail 
to detect structural differences. This has been shown for 
the Coulomb matrix (28j | and for other nxn matrices it is 
shown in our Supplementary Material [231 ] . Unique chem- 



ical environment descriptors can however be obtained by 
adding information about the radial distribution of the 
neighbours [29L [30| . 

In the following we will describe several matrix con- 
structions which can be used for fingerprinting. These 
matrices are closely related to measurable quantities that 
are traditionally used by experimentalists to identify 
structures. 

Emission and absorption spectra arise from transitions 
between discrete electronic energy levels. Each element 
has its characteristic energetic levels and therefore atomic 
spectra can be used as elemental fingerprints. When 
atoms are assembled into structures the electronic states 
of the constituent atoms are modified depending on the 
arrangement of the atoms. A computational analogue to 
electronic energy levels probed by various spectroscopic 
experiments are the Kohn-Sham (KS) energy eigenval- 
ues, even though they do not represent the physical ex- 
citation energies. Since the Hamiltonian matrix depends 
only on the interatomic distances, the sorted KS eigen- 
values are invariant to translations, rotations, reflections 



freedom of the molecule. The condition ej 



e? can be 



satisfied for two different configurations p and q for all 
i = 1,...,N only if the number of conditions N is less 
than the number of degrees of freedom. If N > in — 6 all 
conditions can therefore in general not be satisfied and 
the coincidence axiom holds. If one considers a subset 
of the configuration space, such as all metastable con- 
figurations, it turns however out that in practice even 
eigenvalue vectors which are shorter than 3n — 6, give a 
good fingerprint, since it is unlikely that two configura- 
tions in the physically accesiblc regions of the potential 
energy surface will be in the subspace that leaves all the 
eigenvalues invariant. 

We examine fingerprints that are based on the oc- 
cupied KS eigenvalues only as well as fingerprints that 
are based both on the occupied and unoccupied eigen- 
values. The former were obtained from the sclfconsis- 
tent eigenvalues obtained in a large wavelet basis flfjj ]. 
whereas the latter were obtained for simplicity from the 
non-selfconsistent input guess eigenvalues obtained in a 
minimal Gaussian type atomic orbitals (GTO) basis set 
for a charge density which is a superposition of atomic 
charge densities. Configuration distances obtained from 
the occupied KS eigenvalues denoted by Aks{p, q) show 
a perfect correlation with the RMSD for all three test 
sets, see Fig. [2] Even though the vector Vgto is m a ll 
cases longer than the vector Vks (eg- two times longer 
in case of the Si cluster), fingerprint distances based on 
the former do not better correlate with the RMSD than 
fingerprint distances based on the latter, even though the 
coincidence theorem is not satisfied. 

A matrix which has similar properties as the Hamil- 
tonian matrix is the overlap matrix (OM) expressed 
in terms of GTOs. Contrary to the Hamiltonian, all 
elements of the OM can easily be calculated analyti- 
cally. 23j In the simplest case where only uncontracted 
s-type GTOs are used, the resulting atomic fingerprint 
consists of a single scalar. Information about the radial 
distribution can be incorporated in the OM by adding p 
and d type GTOs. In this way the fingerprint vector be- 
comes also longer than 3n — 6 and the coincidence axiom 
is satisfied. If the fingerprint is used to calculate distances 
between our test configurations, it however turns out that 
adding p-type orbitals gives only a small improvement 
and adding additional d-type orbitals gives only a very 
marginal improvement. The distances obtained with this 
fingerprint denoted by Aom{p, <?) in Fig. [2] show again a 
very good correlation with the RMSD. The width of the 
GTOs was in all our tests given by the covalent radius of 
the atom on which the GTO was centered. 

The vibrational properties, which are frequently used 
experimentally to identify structures, are closely related 
to the Hessian matrix which consists of the second or- 
der derivatives of the energy with respect to the atomic 
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culation the metric based on the Kohn-Sham eigenvalues 
is a good choice since the eigenvalues are a byproduct 
of any DFT calculation and thus no extra effort is re- 
quired to obtain them. If Kohn-Sham eigenvalues are 
not available, the method based on the eigenvalues of 
the overlap matrix is recommended, since it leads to ma- 
trices whose elements can be calculated analytically. The 
Hessian matrix of a classical interaction is however also 
a good choice. All the proposed methods are entirely pa- 
rameter free. For the calculation of the overlap matrix a 
length scale is needed but it is universally given by the 
covalcnt radii of the atoms. 

We gratefully thank D.G. Kanhere, S. De, R. Schnei- 
der and R. Ebrahimian for interesting and helpful discus- 
sions. This work has been supported by the Swiss Na- 
tional Science Foundation (SNF) and the Swiss National 
Center of Competence in Research (NCCR) on Nanoscale 
Science. Computing time was provided by the CSCS. 



FIG. 2. Correlation of metrics based on KS, OM (s-only in 
red, both s and p in green) and Hessian eigenvalues with 
RMSD. 



positions. This matrix belongs to the class of matrices 
with the desired properties. For an arbitrary configu- 
ration there are 3n — 3 nonzero eigenvalues which are 
related to the vibrational frequencies. This makes the 
vector just long enough to satisfy the coincidence axiom 
and it is therefore a good fingerprint vector. Unfortu- 
nately the calculation of the Hessian is rather expensive 
in the context of a DFT calculation and can also be cum- 
bersome with sophisticated force fields. We will therefore 
not further pursue approaches based on an Hessian which 
is calculated within the same method as the energy and 
forces. It however turns out that eigenvalues or eigenvec- 
tors of the Hessian matrices which are derived from an- 
other cheaper potential such as the Lennard Jones (LJ) 
potential give also good fingerprints. This is shown in 
Fig. [2] for our test systems after the lengths were scaled 
to the equilibrium bondlength of the LJ potential. 

In summary, we have shown that the RMSD, the most 
natural measure of dissimilarity between two configura- 
tions, satisfies the properties of a metric when is obtained 
by a global minimization over all rotations and index per- 
mutations. We have presented a Monte Carlo method 
to calculate the global minimal RMSD which does not 
require to try out all possible index permutations and 
which is thus computationally feasible. At the same time 
we have introduced other metrics which are much cheaper 
to calculate because they do not require a structural su- 
perposition. Nevertheless they correlate in all our test 
cases extremely well with the RMSD. Within a DFT cal- 
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Iterative global minimization of RMSD 

As stated in the main text, the algorithm of finding 
the global minimum of the RMSD has two major stages. 
The flowchart of the algorithm is depicted in Fig.lSllwith 
the two different stages shown on the left and right sides. 
In the first stage, we first align two of the three principal 
axes of inertia of one configurations with the correspond- 
ing axes of the other one. The positive direction of each 
axis is chosen such that the third moment of the mass 
distribution (skewness) along that axis is positive. One 
can argue that two configurations related by a mirror re- 
flection are identical or not. If the former is the case, 
we try out both positive and negative skewness along the 
third principal axis by applying a mirror reflection. Then 
atomic matching is done in the Cartesian space by asso- 
ciating to each atom i e p the closest atom j £ q such 



that d; 



- ll»P 



rA\ is minimal (the superscript pq is 



dropped for convenience). In other words, the columns 
of the nxn matrix made by dij 's are reordered such that 
its trace is minimal. In terms of the assignment problem, 
this can be solved with a little computation by minimiz- 
ing the cost Q3i<fe) using the Hungarian algorithm [lj. 
With this initial index matching, a rotation using the 
quaternions is applied to refine the molecular alignment. 
If the required rotation is significant, the atomic assign- 
ment should be repeated. It iterates until the atomic 
indices remain fixed after applying the exact rotation. 

The success of the alignment with any kind of axes 
set is not necessarily guaranteed. For example, if the 
inertia tensor has degenerate or near-degenerate eigen- 
values while the structure is not symmetric, the prin- 
cipal axes of inertia is not a good descriptor of the 
molecule orientation. A glassy structure is also a chal- 
lenging case for atomic assignment based on molecular 
alignment because small displacements of particles yield 
a different configuration while the axes of inertia rotate 
only slightly. Therefore a second alignment trial is done. 
Using Wi = SiVi, with Si and Vi being respectively scalar 
and vectorial atomic fingerprints, we define a molecule- 
attached basis set as 



i 
n 



(SI) 
(S2) 



where the sum runs over the atoms and x denotes the 
cross product and represents the positions of atoms 
with respect to the center of mass. First, we align W q 




FIG. SI. Flowchart of the algorithm of global minimization 
of RMSD in two major steps. The loop on the right runs over 
several axes sets and matches atoms of a pair of configurations 
via aligning their molecular axes. The left loop shows the 
Monte Carlo (MC) permutation of particles guided by the 
atomic fingerprint distances. The impact of biassing decreases 
by increasing the permutation probability. The dashed line 
shows that the right process could be excluded. 
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TABLE SI. Number (=ff) of remained distinct configurations 
and average RMSD (in A) during the two stage RMSD global 
minimization. In the first stage (HA), the principal axes of in- 
ertia as well as three molecular axes sets obtained from atomic 
vectorial fingerprints are used. Every molecular alignment is 
always followed by application of the HA to find the optimal 
index permutation. In the second stage (MC), random permu- 
tations are tried out which are followed by local minimization 
to get the optimal rotation. Because of the stochastic nature 
of the MC part, the reported values might change in different 
runs. 









Si32 


Mg 26 


C22 


H24N2O3 






# 


RMSD 


# 


RMSD 


# 


RMSD 


initial 


317 


1.40 


111 


3.44 


60 


2.75 




axes of inertia 


184 


1.16 


60 


1.08 


42 


1.93 


HA 


(W, W') ai 


184 


1.06 


59 


1.06 


42 


1.89 


(W, w% 2 


184 


1.04 


59 


1.03 


42 


1.81 




(W, w% 3 


184 


1.02 


59 


1.01 


42 


1.78 




iter -10" 


184 


.978 


59 


.985 


42 


1.52 




iter -10 4 


184 


.910 


59 


.864 


42 


1.51 


MC 


iter -10 5 


184 


.852 


59 


.852 


42 


1.51 




iter -10 6 


184 


.792 


59 


.824 


42 


1.51 




iter -10 7 


184 


.791 


59 


.824 


42 


1.51 




FIG. S2. Describing atomic environment by scalar and vec- 
torial atomic fingerprints for a Si32 cluster obtained from 
the principal eigenvector of OM matrix with one s and one 
P — (Px,Py,Pz) GTOs (see the text). The gradient color of 
the vectorial (p-type) fingerprint indicates the value (small is 
red and large in blue) of the scalar (s-type) fingerprint 



about 10 seconds on a single 2.4 GHz Intel core. 



with W' p and then rotate q around it such that the plane 
made by (W q ,W' q ) coincides with the plane made by 
(W p ,W' P ). Depending the used Gaussian widths to get 
Si and Vi , several basis sets may be constructed and tried 
one-by-one in this stage. If the alignment of a new axes 
set results in a smaller RMSD, we accept it. In Table [Si] 
we show the results of the alignment of the principal axes 
of inertia as well as three sets of (W,W) axes obtained 
by three different Gaussian widths a. 

The initial permutation given by the atomic matching 
in the Cartesian space may be replaced and/or followed 
by the atomic matching in the atomic fingerprint space. 
In other words, the Hungarian algorithm is used to min- 
imize the cost J2i da where = \\v p — Vj\\. In the sim- 
plest case, i.e. when a single scalar fingerprint is used, 
one sorts the atoms of each configuration with respect to 
their fingerprint values. 

If the structures were really identical either of the 
atomic assignment using HA following axes alignments 
would give the global minimum RMSD, which will then 
be zero. For near identical structures, different types of 
GTO's or even differently defined Wi from ours would 
give an arbitrary number of (W,W) sets to be be tried 
out in order to sample the most favorable region of the 
rotation space. In general, however, the assignment prob- 
lem might need to be treated using our second stage i.e. 
by performing MC permutations of atomic indices. The 
iteration stops when the global minimum RMSD stops 
becoming smaller. The number of required MC itera- 
tions depends on the system size, see Table EU For the 
system with largest number of nonidentical particles, the 
32 atom silicon cluster, 10 6 MC iterations take on average 



Overlaps between GTOs 

In Cartesian coordinates a normalized GTO centered 
at the atomic positions t\ is given by 

4(r) =N t (x- Xi ) l *(y - yj ) l *(z - z^e-^^ 

where I = (l x ,ly,l z ) and Ni is the normalization factor. 
Depending on the angular moment L — l x + l y + l z the 
functions are labeled as as s-type (L=0), p-type (L=l), 
d-type (L=2) and so on. We take the Gaussian width on 
inversely proportional to the square of the covalent radius 
of atom i throughout this work. The general formulas 
for the elements of the overlap matrix are given, e.g., by 
Eq. (3.5) in Ref. Q (a dimensional analysis shows that 
the formulas given in Appendix of Ref. |2j can not be 
true). For convenience, we restate the simplified relations 
for the special cases involving s and p-type GTOs. 

The Gaussian product theorem says that the product 
of two Gaussian functions is again a Gaussian function. 
Therefore the overlap integrals between a pair of GTOs, 
namely 

(4\$) = fdr<j>\{r)$(r) (S3) 

can be evaluated analytically. This gives the normaliza- 
tion factors as 

= [—) (V ly n h ) , n h = 



Note that all GTOs are recursively obtained by differenti- 
ating (j>l(r) = (2ai/ir) 3 / 4 exp(— a,||r — t*,/ 1 1 2 ) with respect 
to atomic positions. For instance 

(r)=2^ i {x-x i mr) = ^ ! ^-. (S4) 

Therefore, the overlap integrals can also be conveniently 
calculated from recursion relations Q. The overlap of 
two s GTOs is spherically symmetric 

s - = <*< M = U+^J exp l-^T^ r «J 

where r.^ = ||rj — r,-||. In contrast, overlaps with higher 
order GTOs are directional dependent. For s-p and p-p 
overlaps, using the last term in Eq. (IS4[) we obtain 

W W = ~{«T+a-Y {Xi ~ Xj)iVi ~ Vj)S « 
and so on. 

Eigenvectors of the overlap matrix (OM) (4>\\4> l j ) give 
maximum-overlap molecular orbitals. A measure of cova- 
lent bondings strength on each molecular orbital is given 
by the corresponding eigenvalue However, those cor- 
responding to tiny eigenvalues (week bonding) are empty 
of information. The strongest bonding corresponds to the 
largest eigenvalue and therefore it is preferred that the 
principal eigenvector is filled with V = (v\, t>2, ■ • • , v n ). 
Each Vi is an atomic fingerprint. In other words, an OM 
eigenvector is a set of coefficients Vi = {c\} that compose 
a molecular orbital \P(r) — J^i i c \ ( t ) \{ r ) where i runs over 
all atoms. For example, the coefficients corresponding to 
a P-tyP e GTOs determine its orientation in the space 
which is ahown as arrows in Fig. [S2] where the colors 
correspond to the coefficient of the maximum-overlap s 
GTOs. Figure IS3l illustrates the charge density \&(r)\ 2 
from maximally overlapped s and p GTOs put on atoms 
of a water molecule. 
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FIG. S3. Pseudocharge density |?/i(r)| 2 obtained from the 
maximally-overlapped Gaussian functions centered on the 
atoms of a water molecule. Contour maps show individual 
contributions from (a) oxygen and (b) hydrogen atoms to (c) 
water molecule projected on the molecule sheet. 



SPRINT can not uniquely describe the chemical environ- 
ment of an atom, since the Si atoms have very different 
environments while some of them have identical SPRINT 
coordinates. For the acetylene molecule a number of con- 
figurations with identical eigenvalues of the Coulomb ma- 
trix is found in Ref. Q . However, by including both s and 
p type GTO's in the OM the number of eigenvalues be- 
come 4n and all of our attempts to minimize Ay(p 1 q) 
either yielded to p=q or failed to reach a small Av- 




FIG. S4. Two distinct configurations of the S15 cluster with 
an identical set of SPRINT coordinates, i.e 3.59 (green), 4.37 
(red), 4.85 (blue). Bondlengths are shown in A. (a) is planner 
and a local minimum in LDA-DFT. 



Satisfying the coincidence axiom 

If ./V is smaller than the internal number of degrees of 
freedom 3n — 6, then V £ M. N is not a complete rep- 
resentation of the configuration R £ R 3n . For exam- 
ple, using any of the n by n matrices, including Laplca- 
ian, SPRINT, Coulomb and overlap of s type GTO's 
matrices, we could always find distinct configurations 
(large RMSD) with vanishing Ay for our benchmarking 
molecules. This is done by minimizing [5] Ay(p, q) with 
respect to R q . Using the parameters given in the Sup- 
plementary Material of Ref. @ , we show in Fig. [S4] two 
distinct configurations of a Sis cluster which have identi- 
cal sets of SPRINT coordinates. Figure [S4Tb) shows that 



Closed-form of superimposing rotation 

A quaternion Q = (Qo , Qi , Q2 , Q3) is an extension 
of the idea of complex numbers to one real (Qo) and 
three imaginary parts. According to the Euler's rotation 
theorem, a rotation in space which keeps one point on the 
rigid body (centroid in our case) fixed, can be represented 
by four real numbers: one for the rotation angle and 
three for the rotation axis (we assume that the center of 
rotation is on the origin). A unit quaternion, i.e. || Q|| 2 = 
Qo + Qi + Q| + Q4 = 1) can represent conveniently this 
axis-angle couple as 
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'Ql + Ql-Ql-Ql IQxQi 

2SlQ 2 + 2QoS 3 Ql-Q\-r^ 2 

2Q1Q3-2Q0Q2 2Q 2 Q 3 + 2QoQi 



2QoQ 3 
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2Q1Q3 + 2Q0Q2 
2Q2Q3-2Q0Q1 



Assume that U is the 3x3 equivalence of the rotation 
operator U in Eq. (2) The optimum rotation 11 which 
minimizes RMSD, indeed maximizes the correlation be- 
tween BP and R q , i.e. the atomic Cartesian coordinates 
with respect to the common center of mass. The Kab- 
sch algorithm |8[ provides one possible solution to this 

I 



T = 



where 1Z is the correlation matrix whose elements are 
x\y\ and so. Note that, without applying U 
on R q , Eq. (2) is then given by 



RMSD(p,q) = 




where A* is the largest eigenvalue of T . 
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